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Fast optimal CMB power spectrum estimation with Hamiltonian 
sampling 
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ABSTRACT 

We present a method for fast optimal estimation of the temperature angular power spectrum 
from observations of the cosmic microwave background. We employ a Hamiltonian Monte 
Carlo (HMC) sampler to obtain samples from the posterior probability distribution of all the 
power spectrum coefficients given a set of observations. We compare the properties of the 
HMC and the related Gibbs sampling approach on low-resolution simulations and find that 
the HMC method performs favourably even in the regime of relatively low signal-to-noise. We 
also demonstrate the method on high-resolution data by applying it to simulated WMAP data. 
Analysis of a WMAP-sized data set is possible in a around eighty hours on a high-end desktop 
computer HMC imposes few conditions on the distribution to be sampled and provides us 
with an extremely flexible approach upon which to build. 

Key words: cosmic microwave background - methods: data analysis - methods: statistical 



1 INTRODUCTION 

Observations of the cosmic microwave background (CMB) have 
proved to be extremely valuable for testing and constraining 
cosmological models. The majority of models predict that the 
anisotropies in the CMB signal are Gaussian and their statistics 
isotropic across the sky. The angular power spectrum Ce therefore 
provides a natural connection between theory and observation and a 
variety of methods have been explored to compute the power spec- 
trum from sets of observations. 

Maximum-likel i hood methods | Gorski 1 19941 : 

iBond. Jaffe & Knoxl 1 19981 : lOh. Spergel & Hinshawl 



199^ 



pro- 
vide an optimal estimate of the CMB power spectrum which 
has made them an invaluable tool for an alysing the CMB for 
single -dish experiments and interferometers dHobson & Maisinged 
l2002h . Brute force implementations of the method can only be 
applied to small data sets as the required computation scales as 
C9(A^piy), where A^pix is the number of pixels in a CMB map (see 
lEfstathioul ( l2003h for a review). For a number of special cases 
one can construc t maximum-l ikelihood esti mators that perform 
more favourably jChallinor et~ al. 2002; Wande lt & Hansenll2003h . 
although their lack of generality limits their applicability and, as 
their computational demands scale as 0{Npi^), even they cannot 
be applied directly the largest contemporary (WMAP) or future 
(Planck) data. 

Alternatively one ca n resort to approximate pseudo-C« meth- 
ods, iHivon et"^ J ( I2OO2I) . These scale as the map-making pro- 
cess and are fast eve n for the largest data sets. Hybrid meth- 
ods tefstathioul I2OO4I) combine a maximum-likelihood approach 
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on large angular scales with a fast pseudo-C^ estimator on small 
scales. 

To compare theoretically predicted power spectra and those 
estimated from a set of observations it is necessary to construct a 
likelihood function. Maximum-likelihood and pseudo-C^ methods 
can only provide approximations to this likelihood. 

An al ternative framework has b een devel- 
oped jWandelt. Larson & Laksminaravanal |2004| : 

Ijewell. Levin & Anderson 2004) where one explores the full 
posterior distribution of the power spectrum with Monte Carlo 
samples. This method is not only exact but scales like the pseudo- 
Ce methods. Under the assumption of position invariant, circularly 
symmetric beams and uncorrelated noise, one can perform the 
beam convolution in the spherical harmonic domain and evaluate 
the likelihood of the data in the map domain, and the method scales 
as C(Afp/f ). The favourable s caling has ena bled the method to 
be ap pUed to the WMAP data jBennett et al]|2003l : lEriksen et all 
12004 . 

The approach relies on the availability of an efficient method 
for sampling from high-dimensional distributions. Previous imple- 
mentations use a Gibbs sampler but this restricts the applicability 
of the method to Gaussian noise and CMB. We propose the us e of a 
Hamiltonian Monte Carlo (HMC) sampler dOuane et ai]|l987h . As 
opposed to the majority of Markov-Chain Monte Carlo (MCMC) 
methods, HMC scales well with problem size. Few requirements 
are made on the distribution to be sampled, thus giving us the op- 
portunity for great fle xibility. HM C has been widely applied in 
Bayesian computation jNeall 19931) and has also b een employed for 
cosmological parameter estimation ( lHaiianll2007l) . 

In this work we begin, in Section |2l by outlining the proce- 
dure for estimating power spectra with sampling. In Section [3] we 
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describe the HMC method and a technique for determining the con- 
vergence of samples drawn with a HMC sampler. A summary of the 
process of applying HMC to the power spectrum estimation prob- 
lem can be found in Section |4] and we provide a prescription for 
setting the many tuneable parameters of the sampler. In Section |5] 
we apply the method to low-resolution simulations and compare the 
Hamiltonian and Gibbs samplers. Section[6]details our application 
of the method to simulated WMAP observations. Our conclusions 
are presented in Section|7] 



2 POWER SPECTRUM ESTIMATION WITH SAMPLING 

Suppose the true CMB sky, divided for convenience into pixels, is 
represented by the temperature vector t. The sky is observed and 
the resultant data vector d, in any domain, is the sum d = s + nof 
contributions due to the underlying CMB signal s in that domain 
and the corresponding noise n. Moreover the signal s is usually 
linearly related to the true CMB sky t. Thus we have 



d=Rt + n, 



(1) 



where the matrix R represents the linear mapping from the true 
CMB sky to the corresponding CMB signal in whatever domain 
the data resides. 

In the following discussion, we need not assume a particular 
domain for the generic data vector d. Nevertheless, it is most com- 
mon for d to represent the pixelised CMB map convolved with the 
instrument beam and our work, so far, has used solely this form for 
the data vector. 

The temperature field t is related to the spherical harmonic 
coefficients of the field a by 



1 = 2 m=-l 



(2) 



where t (xp) is a single pixel in the map vector t and the Yt,n are 
the spherical harmonics. Although formally one may take the upper 
limit of the £ summation to be infinite, it is more typical to choose a 
finite value for ^max appropriate to the beam size. We have not con- 
sidered the effect of the mono- and dipole contribut ions, the han- 
dling o f which, within this framework, is discussed in lEriksen et al.l 
yOOJ). In this notation we may write our model for the data in the 
form 



d = RYa + n, 



(3) 



where Y describes the application of the spherical harmonic trans- 
form and we represent the spherical harmonic coefficients by a real 
vector. 

For an isotropic Gaussian CMB sky the covariance matrix C 
of the aim has components 



Ctmt'm' — {aemCLe'm') — Cl&U'&n 



(4) 



where the set of coefficients {Ci} constitute the theoretical angular 
power spectrum. Note that, since the sky is real, aim ~ a-t^-m. 

We aim to sample from the joint distribution of the power 
spectrum coefficients Ft {{Ci}\d). Although this is difficult to 
perform directly, it is possible to sample from the joint den- 
sity of the power spectrum coefficients and the signal realization 
Pr ({Cf }, a\d) and then marginalise over a. The joint density can 
be written as the product of the appropriate conditional distribu- 
tions 



The choice of prior Pr {{Ci}) is an interesting topic. lWandelt et all 
t;20o3) have some suggestions for making this choice but for the 
purpose of this work we set Pr {{Ci}) = 1 so that the maximum 
of our posterior will correspond directly to a maximum-likelihood 
estimate. 

Given our choice of prior and assuming the noise is Gaussian 
then the conditional distributions that make up ^ can be written in 
the form 



Pr (dja) oc exp 



i (d - RYa)"^ (d - RYa) 



where N = (nn^), and 

1 / 1 T -1 

Pr (a\(Ci})<x — exp a C a 

\/1Cl V 2 



(6) 



(V) 



where C is easily constructed using Q. It is convenient to rewrite 
this in the form 



Pr(a|{C4)« n (i-) ' exp(- 



21 + 1 ai 
2~C^ 



(8) 



is the power spectrum of the signal 



where ae = ^J^m \° 
realization. 

The selection of a domain in which to represent the data is 
determined by the requirement that N has a simple form. In this 
work we make the assumption that in the map domain N is well 
represented by a diagonal matrix. In this domain incomplete sky 
coverage is straightforwardly handled by setting the elements of 
N^^ that correspond to excluded pixels to zero. If the instrument 
beam is position invariant and circularly symmetric then we can 
compute the beam convolution quickly in harmonic space and the 
predicted noiseless data can be written in the form YBa where B 
represents the smoothing by the beam. 

The computational cost of evaluating the posterior (and its gra- 
dients) is now limited by the speed at which one can compute the 
spherical harmonic transform Y. The transforms scale as 0{N^(^) 
and can be efficiently parallelised. 

We draw samples from the joint space (a, {Ci}) using a 
Hamiltonian Monte-Carlo sampler described in Section[3] 



3 HAMILTONIAN MONTE CARLO 

Let us suppose that we wish to draw samples from a target den- 
sity Pr (x), where x is the TV-dimensional vector of our param- 
eters. Conventional MCMC methods move through the parameter 
space by a random walk and therefore require a prohibitive number 
of samples to explore -high dimensional spaces. The Hami ltonian 
Monte Carlo method dOuane et al.lll987l : lNeal|[l993lll996h draws 
parallels between sampling and classical dynamics. By exploiting 
techniques developed for describing the motion of particles in po- 
tentials it is possible to suppress random walk behaviour. Introduc- 
ing persistent motion of the chain through the parameter space al- 
lows HMC to maint ain a reasonab le efficiency even for high di- 
mensional problems jHansonlboOlh . 

For each parameter, Xi we introduce a 'momentum' pi and 
a 'mass' m;; we discuss how to set the mass in the Appendix. We 
construct a Hamiltonian formed from a potential energy term tp (x) 
and a kinetic energy term such that 



Pr {{Ce}, a\d) oc Pr (d|a) Pr {a\{Ce}) Pr {{Ci}) . 



(5) 



(9) 
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where our potential is related to the target density by 

ip ix) = — log Pr ix) . 



(10) 



Our new objective is to draw samples from a distribution that is 
proportional to exp {—H). The form of the Hamiltonian is such 
that this distribution is separable into a Gaussian in p and the target 
distribution, i.e. 



exp i-H) = Pr (x) Jl exp ( - ^ 



(11) 



We can then obtain samples from Pr (x) by marginalising over p. 

To find a new sample we first draw a set of momenta from 
the distribution defined by our kinetic energy term, i.e. an A'^ di- 
mensional uncorrelated Gaussian with a variance in dimension i 
of rrii. We then allow our system to evolve deterministically, from 
our starting point {x,p) in the phase space for some fixed time r 
according to Hamilton's equations, 

dxj _ dH 
dt dpi 

dpi dH dxp (x) 



(12) 



(13) 



At the end of this trajectory we have reached the point {x' ,p') and 
we accept this point with probability 

PA = min(l,exp(-(5_W)) , (14) 
where 

5H ^ H (x' ,p') - H{x,p) . (15) 

After a new proposed sample is generated the momentum variable 
is discarded and the process restarts by randomly drawing a new 
set of momenta as described above. 

This implies that if we are able to integrate Hamilton's equa- 
tions exactly then, as energy is conserved along such a trajectory, 
the probability of acceptance is unity. 

In fact the method is more general as, provided one uses the 
Metropolis acceptance criterion | |14I >. it is permitted to follow any 
trajectory to generate a new candidate point. However only trajec- 
tories that approximately conserve the value of the Hamiltonian |9} 
will result in high acceptance rates. For some problems it may be 
advantageous to generate trajectories using an approximate Hamil- 
tonian that can be computed rapidly, and bear the cost of lowering 
the acceptance probability. 

To integrate the equations of motions it is common practice to 
use the leapfrog method. This method has the property of exact re- 
versibility which is required to ensure the chain satisfies detailed 
balance. It is also numerically robust and allows for the simple 
propagation of errors. We make n steps with a finite step size e, 
such that ne = r, as follows. 



e 9i/) {x) 



2 ^x^ 



c(t) 



) = X^{t) + 



= Pi 



0+1) -I 



e 9?/; ix) 



Xi {t + e 
Pt [t + e 

until t = T. The interval r must be varied, usually by drawing n 
and e randomly from uniform distributions, to avoid resonant tra- 
jectories. Higher-order integration schemes are permitted, provided 
exact reversibility is maintained, although generally incur signifi- 
cant additional computational costs. 



QXi 



(16) 
(17) 
(18) 



3.1 Convergence tests 

Diagnosing the convergence of a chai n in an MCMC proces s 
is the subject of much literature (see ICowles & CarlinI (Il996h: 
iBro oks & RobertsI ( Il997h for comprehensive reviews). iHansonl 
(200lF provides a method that uses the gradient information, which 
we must possess to calculate trajectories in HMC, to compute a 
convergence criteria. 

One constructs two estimates of the variance of a chain, that 
depend quite differently upon the distribution of samples across the 
target density, although the basic method is easily generalised to 
(combinations of) higher order central moments of Pr (a;). When 
the two estimates agree to within a certain accuracy the chain is 
assumed to have converged. 

We compute the variance of each parameter Xi independently. 
Our first estimate of the variance of the samples is calculated by 



{Xi 



Xi) Pr (x) dx : 



(19) 



where k labels a sample in a chain of A/ samples and the integral 
extends over the entire x-space. For our second estimate we take 
the expression for the variance and integrate by parts 

{xi - Xif Pr (a;) da; 

1|. _v,3t-./ 
= -\[X^- X^) VT{X^)\ 

1 / _ ^dx, (20) 
3 J dxt 

the first term of which will vanish if the marginalized distribution 
Pr {xi) drops off faster than x'l as Xi tends to ±oo. Using l IlOl l we 
rewrite this expression as 



.^ = \r (x.^-x.f^-^Vrix)dx. 
6 / oxi 

We compute l l21t from the samples in our chain by 



MS^*^ ' > hx. 



(21) 



(22) 



To test for convergence we compute the ratio Ri of l ll9t and l l22t . 
and we believe the chain has converged when all the Ri are close 
to unity. 

We have tested how this criterion compares t o the widely used 
Gelman-Rubin statistic jGelman &Rubinlll99"2l) and have found 
that Hanson's method tends to be, if anything, slightly pessimistic. 
We find that values of R in the range 0.8 to 1.2 represent good 
convergence and values in the range 0.6 to 1.4 are acceptable. 
The Gelman-Rubin method requires multiple chains to be gener- 
ated and compares inter-chain with intra-chain statistics, whereas 
Hanson's test uses a single chain and compares two different intra- 
chain statistics. We use the Hanson test as it is very easy to com- 
pute, scales well with problem size and requires that we only gen- 
erate one chain. We plan to explore other intra-cha i n con vergence 
diagnostics such as that proposed bv lDunklev etai] ( l2005h . 



4 HAMILTONIAN MONTE CARLO AND POWER 
SPECTRUM ESTIMATION 

We use HMC to draw samples simultaneously from the joint 
density l[5}- Our potential is defined by ?/) (a, {CiY) = 
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- log Pr (a, {Ci}\d) such that 

iP (a, {Ci}) = \{d- YBa)"^ N"' (d - YBa) 



and the gradient of the potential can be computed exactly by 
9V(a,{C4) 



9a 

9^ (a,{C4) 
9C« 



YBa) + C a 



(23) 



(24) 



(25) 



The positivity requirement on the power spectrum Ci can re- 
sult in a high rejection rate and we have found it advantageous to 
reparametrize the problem in terms of the logarithm of the s. For 
this reparametrization it is easy to calculate the corresponding po- 
tential and its derivatives. To enforce a flat prior on each Ce we 
must apply an exponential prior on log Ci. 

To generate a new sample requires us to evaluate the gradient 
at each point along the leapfrog trajectory and to evaluate the value 
of the potential once at the end of the trajectory. Therefore, if we 
take n leapfrog steps, we must perform 2n + 1 spherical harmonic 
transforms, although we can reuse the gradient at the end of one 
trajectory for the first step of the next. 

We split the sampling process into a bum in phase, in which 
we attempt to lose any dependence on our starting point, and a sam- 
pling phase where we store the samples from the chain and we be- 
lieve these samples are drawn from the target density. During burn 
in we are permitted to adjust the parameters of the sampler, for ex- 
ample to tune the acceptance rate. Once bum in is complete we 
must fix the parameters of the sampler in order that our samples 
come from the desired distribution. 

A good starting point can significantly reduce the time re- 
quired for burn in. We have explored a number of possibilities for 
computing a starting point for the signal a given some initial guess 
for the power spectrum. One we have found particularly effective is 
to draw a single signal sample, as for one step of the Gibbs sampler, 
from the conditional distribution Pr (a|d, {C<}). T his is a compu- 
tatio nally expensive process and is described fully in Wandelt et al. 
( 12004 ): Eriksen et al. (2004). The basic procedure involves solving 
the following equation for a;, the spherical harmonic coefficients of 
the mean field (Wiener filtered) map, 

(C"' + BY^N^YB) X = BY^N^d (26) 

and a fluctuation term y that corrects for the bias in x 

(C"' + BY'^N^'YB) y = C"'/^Lj(, + BY'^N-'/^cJi, (27) 

where luq is a set of spherical harmonic coefficients and uji a map 
both containing Gaussian white noise of zero mean and unit vari- 
ance. The sum of x and y is our starting sample a. We solve 
for X + y using a conju gate gradient algorithm (see, for exam- 
ple lOolub & LoarJ ( Il996l) ). A preconditioner can be used to reduce 
the number of iterations required for the convergence of the conju- 
gate gradient solver, however the construction of a preconditioner 
is itself a complex procedure, and since we only perform this step 
once and the accuracy of the result is of little consequence, we have 
not made use of one in this work. Whether or not applying the con- 
jugate gradient algorithm without a preconditioner is feasible de- 
pends on the nature of the data set under consideration. 

HMC has a large number of adjustable parameters, notably the 
masses. The distribution for the a parameters is Gaussian and so we 
attempt to set the mass associated with each agrn such that they are 



inversely proportional to the variance of that aem- We justify this 
choice in the Appendix. The masses for the a are estimated for a 
fixed power spectrum for which the variance is computed by 



(28) 



where we use our initial estimate of the power spectrum as the value 
of Ce and compute the diagonal elements of the inverse noise co- 
variance matrix in harmonic space using Monte Carlo simulations. 

At high £ and with good signal-to-noise the marginal distribu- 
tions for each Ce are close to Gaussian and we can obtain masses 
from the standard expressi on for the variance (see, for example 



trom me stanaarg expressi on 
IZaldarriaga & Selialj lT997h ) 

var(C<) = (Ce + Ne/B^? , 



(29) 



sky 



where A'^^ is the power spectrum of the noise in the daX&^Bi is the 
beam transfer function and /sky is the fraction of the sky observed. 
For low multipoles the distributions are significantly skewed and in 
low signal-to-noise the sharp cut off of the distribution at = 
has a similar effect. In these cases we have found that setting the 
masses from the variances is insufficient. Instead we tune these 
masses empirically. We aim to set the mass for each parameter to 
as small a value as possible while maintaining our target accep- 
tance rate. We sample the C^s from simple approximate likelihood 
function and gradually reduce the masses until the acceptance rate 
drops. This gives masses that are sufficient for sampling the full 
problem efficiently. 

During bum in we can further tune the masses; the conver- 
gence criterion for each parameter providing a good indication of 
whether or not the mass associated with that parameter is set cor- 
rectly. 

We must randomise the length of each trajectory and have 
found that drawing n from a uniform distribution between 10 and 
20 is appropriate. Therefore we typically require the application of 
~ 30 spherical harmonic transforms to generate a new proposed 
sample. We then tune the step size e such that we obtain an accep- 
tance rate between 70 and 90 per cent. A higher acceptance rate 
is used for HMC than other MCMC methods as the computational 
cost of a rejection is so high. 

Once sampling we store each {Cf } sample and the realization 
power spectrum {at} of each signal sample. The {op} can be used 
to form the Blackwell-Rao estimator of the posterior distribution 
dChu et al. 2005). The posterior can be written 

Pr({Ca|d) = / Pr({C,},a|d)da 



Pr({C£}|a)Pr(a|d)da, (30) 
which for a Gaussian CMB can be written as 
Pr({(7a|d) = j ¥r{{Ct}\{o,})¥r{{ai}\d)A{o,}, (31) 
where 



21+1 

p.((c,>iw)^n;i(g) ' «■>(- 



2 Ct 



(32) 



We can therefore compute the posterior probability of a set of {Cf } 
from M samples {cr^} by 



Pr({C,}|'i)^^^Pr({C,}|W}) 



(33) 
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It is also possible to construct the marginal distributions for any Ci 
or subset of {Ci}. For a single Ct the marginal distribution can be 
approximated by 



Pr(C,|d) 



where 



M 51 



Pr (CiWi) 



21+1 



2 Ce 



(34) 



(35) 



Extremely large numbers of samples would be needed to make this 
estimator accurate at high £. However even with a relatively small 
number of samples it forms a useful tool for the analysis of large 
angular scales. It is worth noting that the expression l l33b . or its 
one-dimensional marginalized version i34l . do not depend on the 
{Cfj-samples, but only on the realization power spectra {ae} of 
the a-samples. 



5 ANALYSIS OF LOW-RESOLUTION SIMULATIONS 

To compare the Hamiltonian and Gibbs samplers we applied them 
both to a set of low-resolution simulations. We produced a map 
of the CMB with a HEALPi]^] iVsido = 32 (12288 pixels). Our 
CMB simulation is a realization of a ACDM cosmology with 
the best fitting para meters from the 5-year WMAP observation^ 
jSpergel et al. [2007) and includes multipoles up to ^ = 64. We 



smoothed the map with a 3-degree Gaussian beam and added 
isotropic noise with an RMS amplitude of 55fiK per pixel. We 
chose the noise level so that we could explore how the sampler be- 
haved as a function of the signal-to-noise ratio. We degraded the 
WMAP Kp2 mask such that any (large) pixel in our final mask is 
excluded if any of the (small) subpixels in the original mask are 
excluded. This has the effect of enlarging the Kp2 mask to remove 
around 30 percent of the sky: a large contiguous area along the 
Galactic plane and a number of small regions around the locations 
of bright point sources. 

For each sampler we take 20000 burn in samples and then 
record the next 50000 samples. A large number of samples helps 
to estimate correlation lengths accurately; far fewer samples are re- 
quired to explore the distribution. The marginal distributions of a 
selection of the Ce are shown in Fig.[T] For most £ the data is too 
noisy to constrain the value of the d however we do see good 
agreement between the results from the Gibbs and Hamiltonian 
samplers. The HMC samples have also been used in conjunction 
with the Blackwell-Rao estimator to generate a smooth approxima- 
tion to the marginal distributions. This estimator appears to agree 
well with the histograms across this range of £, but more samples 
are likely to be needed if we were to calculate the joint distribution 
of the {Ce}. 

In order to characterise the performance and efficiency of the 
samplers we considered the correlation of the {Ce} samples. As- 
suming that the Ces are independent we can examine the auto- 
correlation function, 



C{n) 



/ CI - {Ci) Cl'^ 



\ v/Var(Cf ) ^Ws.r{Ce) 



(36) 




10 10 
signal to noise ratio 



Figure 3. The correlation length j37t as a function of the signal-to-noise 
ratio of each Ce pai ameter. The red points show the results from the Gibbs 
sampler the blue points those from the Hamiltonian sampler. 



We show the auto-correlation function for a selection of multipoles 
in Fig.|2] As the signal-to-noise ratio for a single I, defined as the 
ratio of the signal and noise power spectra at that I, decreases with 
increasing £ the samples become more highly correlated; it takes 
more steps of the samplers to generate independent samples. This 
feature is a well known limitation of the Gibbs sampler caused by 
the fact that drawing the power spectrum from the conditional dis- 
tribution Pr ({Cf}|d, a) is limited to the size of the cosmic vari- 
ance while the joint distribution may be much wider. Similar be- 
haviour is observed with the Hamiltonian sampler although the 
cause is now related to the difficulty in sampling the highly skewed 
distributions that occur when the signal-to-noise ratio is low. The 
correlation length for each parameter can be estimated using 



I ■ 



1 + 2 £ C(r 



(37) 



^ [http://healpix.jpl.nas a. gov| 

^ http://lambda.gsfc.nasa.gov/product/map/dr3/parameters.cfm 



where we truncate the summation at some maximum lag rimax at 
which the auto-correlation function becomes noisy. Fig. |3] shows 
how the measured correlation lengths for the power spectrum pa- 
rameters from the Gibbs and Hamiltonian samplers depend on the 
signal-to-noise ratio for each parameter, again estimated assuming 
the parameters are independent. We see that in the high signal- 
to-noise regime the Gibbs sampler performs exceptionally well 
whereas the Hamiltonian sampler produces samples with typical 
correlation lengths of around four steps. Once the data becomes 
noise dominated the picture is less clear with the Hamiltonian sam- 
pler generally performing marginally better than the Gibbs sampler. 
As the signal-to-noise ratio drops below about 0.01 both samplers 
perform poorly. 

It is worth noting that Hamiltonian sampler requires around 
an order of magnitude fewer spherical harmonic transforms (the 
computationally intensive step in the process) per sample than a 
Gibbs sampler that uses no preconditioner and around a factor of 
3-4 fewer transforms than is reported for Gibbs samplers with care- 
fully tuned preconditioners (Eriksenet al. 2004). Furthermore we 
have found that the correlation lengths of the Hamiltonian sampler 
strongly depend on the masses one uses, offering the opportunity 
for significant improvements given a more sophisticated prescrip- 
tion for setting the masses. 
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Figure 1. Marginal distributions of Ct samples for a selection of £ as shown in the top right comer of each plot along with the signal to noise ratio for this 
multipole. The results for the Gibbs sampler are shown in red and the Hamiltonian sampler in blue. The plots show the logarithm of the number of samples 
falling in each bin. The dashed vertical line shows the theoretical value of the C< used in creating the simulation whereas the dotted vertical line shows the 
value for the realization. The marginal distributions from the Blackwell-Rao estimator applied to the HMC samples are shown by the smooth black line. 




Figure 2. The auto-correlation functions i36\ of d samples for a selection of £ as shown in the top right comer of each plot along with the signal to noise 
ratio at this multipole. The results for the Gibbs sampler are plotted in red and Hamiltonian sampler in blue. All the plots use the same scale as shown in the 
bottom left plot. 
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Figure 4. Binned power spectram and 68 percent confidence intervals as 
compared to the results of an application of the MASTER method to the 
same simulated WMAP data. The black solid line shows the power spec- 
trum from which the simulation was generated while the grey shows the 
power spectrum of the realization. The grey squares and error bars show the 
MASTER results. The black circles and error bars show the peak and 68 
per cent confidence intervals found from samples generated with the HMC 
sampler. 

6 ANALYSIS OF SIMULATED WMAP DATA 

We produce a CMB simulation as for Section|5]but with A'sidc = 
512 (~ 3 X 10® pixels) and including multipoles up to ^ — 512. 
The map was then smoothed with a 13-arcmin Gaussian beam, 
which is similar in size to the beam of the WMAP W-band. We then 
added anisotropic uncorrelated noise by making use of the pub- 
lishecffl A'obs and noise variance for the 5-year WMAP combined 
W band map. The map was cut with the Kp2 mask which excludes 
15.3% of the sky. We included multipoles up to ^max = 512 in our 
analysis. This gives us a total of around 2 x 10^ parameters in our 
sampling space. 

To generate a good signal starting point, using a single Gibbs 
sample, required ~ 800 iterations (20 minutes on the hardware 
described below) of the conjugate gradient to solve ( 126b and Ml\ 
such that the rms residual was less than 10~®. 

For these simulations we made a total of 5000 burn in samples 
and recorded 10000 samples from the post bum-in phase. It takes 
~ 20 seconds to generate a single sample using two dual core Intel 
Xeon 5150 processors and the MPI parallelised HEALPix spherical 
harmonic transforms, resulting in a total processing time of around 
80 hours. 

For compari son we applied the MASTER method 
jHivon et al.l I2OO2I) to the same data set. Our peak likelihood 
Ci sample and 68 per cent confidence intervals, binned with the 
WMAP team's scheme, are shown alongside the results of the 
MASTER method in Fig. |4] For most of the range of angular 
scales the two estimates and their errors agree well. On the largest 
angular scales the MASTER estimate tends to underestimate the 
uncertainities and the symmetric errors are far from representative 
of the posterior. In Fig.|5]we show a summary of the convergence 
statistics, using Hanson's diagnostic, see Section [3T1 demonstrat- 
ing that we have fully explored the distribution across the entire 

^ http://lambda.gsfc.nasa.gov 
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Figure 5. A summary of the convergence statistics of the 10000 samples 
used to produce the power spectrum in Fig. |4] Although convergence is 
judged from the R for every parameter we show here only the average R 
for in each bin for the Ci (blue line) and a (red line). R . 



range in I. For all multipoles the R value is within the range 0.9 to 
1.1. 



7 CONCLUSIONS 

We have introduced the HMC sampler for CMB power spec- 
trum estimation and demonstrated its performance both on low- 
resolution simulations and simulations of 5-year WMAP data. We 
find that the Hamiltonian sampler has similar or shorter correlation 
lengths when compared to the Gibbs sampler except in the regions 
of the highest signal-to-noise. Bearing in mind the reduced compu- 
tational cost and greater flexibility of the Hamiltonian sampler we 
believe it is an attractive method for performing the analysis. 

For high-resolution data sets of size (A'^sidc = 512, ^max = 
512) we can generate a sample in ^ 20 seconds on a high-end 
desktop. This is a significant gain over the reported performance of 
Gibbs samplers. 

HMC requires that we are able to compute the logarithm of 
the target density and its gradients. Even if exact gradients are 
not available we can generate approximate trajectories and these 
will still result in samples drawn from the required distribution. 
The generality of the approach removes the requirement for strictly 
Gaussian signal and noise and therefore promises to be an interest- 
ing method for tackling a wide range of related problems. 

We are currently testing the performance of the method on 
high-resolution Planck simulations and working on extending the 
method to include polarization. We also intend to apply the tech- 
nique to the WMAP data. 
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APPENDIX A: MASSES FOR HAMILTONIAN MONTE 
CARLO 

Hamiltonian Monte Carlo can be extremely sensitive to the choice 
of masses. When sampling from an approximately isotropic distri- 
bution this does not affect the performance significantly but when 
the marginal distributions of different parameters show consider- 
able variation in width the masses must be set correctly to sample 
effic iently. 

iHansonl jZOOlh suggests that one should set the mass associ- 
ated with each parameter to be approximately equal to the variance 
of that parameter in the target density. This is an attempt to circu- 
larise the trajectories in the {x,p} space. We take an alternative 
approach, where the mass for a parameter is in versely prop ortional 
to the width of the distribution, as suggested in lNeallfl996 ). In or- 
der to justif y this approach we have generalised the framework in 
iNeall 1 I1993I) to describe the application of the leapfrog method. 

Consider the problem of sampling from an n-dimensional 
Gaussian distribution in x with covariance matrix C. Our Hamilto- 
nian is quadratic in x and p 



H 



(Al) 



2 2 

where M is a n x n mass matrix, and the trajectory will be deter- 
mined by Hamilton's equations 

dx 
'dt 
dp 
lit 

We integrate the equation of motion with the leapfrog method 



M-^p 



(A2) 



(A3) 



p{t + e/2)=p{t)-^-C-^x[t) 
x{t + e) = x{t) + eW^p {t + e/2) 
p{t + e)=p{t + e/2) - |c~'a; (t + e) 



(A4) 
(A5) 
(A6) 



A single application of the leapfrog method can be written in the 
form 



: + e) = ( I - y M^'C"' ]x(t) + eM-^p (t) (A7) 



p{t + e)= -eC"' (l-4M''C-^)a;(t) + 

+ (l^4c-iM-i)p(t), (A8) 
where I is the identity matrix. We can rewrite this in a matrix form 

(A9) 



" x(t + €) ' 


= T 


' x{t)' 


_ P (i + e) 




Pit) 



where 



i-4m^c~i 



.(AlO) 



If the method is to be stable under the repeated application of 
T then we require its eigenvalues to have unit modulus. The eigen- 
values A are found from the characteristic equation 



det 



lA-^ - 2A I 



-M-'C 



+ 1 



: 0. 



(All) 



To explore the space rapidly we wish to find the largest e com- 
patible with the condition for stability. Any dependence of dAl lb 
on C implies no single value for e will meet the requirement for ev- 
ery eigenvalue to have unit modulus (unless both C and M are pro- 
portional to the identity matrix). The maximum value for e should 
therefore be controlled by the width of the distribution for a small 
subset of parameters. 

By setting M — C^^ we remove the dependence of e on the 
size of the distribution. In this situation the characteristic equation 
reduces to 



A^ - 2A f 1 - ^ I + 1 



: 



(A12) 



and the stability criterion is met by e ^ 2. 

If the dimensionally of the problem is such that it is impracti- 
cal to perform the required matrix inversion and decomposition of 
M (to compute the Hamiltonian and to draw new values for the mo- 
mentum variables respectively) then simple approximations must 
be employed. Typically one might construct a diagonal mass ma- 
trix with the mass associated with each parameter inversely propor- 
tional to the variance of that parameter. 

If the distribution to be sampled from is not Gaussian it seems 
reasonable to use some appropriate meas ure of the w idth of the 
distribution (i.e. the curvature at the peak l lNea]|[l996l) ) to set the 
masses. 
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